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^ Abstract 

^J We propose a method for nonparametric density estimation that exhibits robustness 

^! to contamination of the training sample. This method achieves robustness by combining 

a traditional kernel density estimator (KDE) with ideas from classical M-estimation. 

We interpret the KDE based on a radial, positive semi-definite kernel as a sample mean 

in the associated reproducing kernel Hilbert space. Since the sample mean is sensitive 

p^ to outliers, we estimate it robustly via M-estimation, yielding a robust kernel density 

> estimator (RKDE). 



C/2 



^f-\ An RKDE can be computed efficiently via a kernelized iteratively re-weighted least 



cn 



X 

cd 



squares (IRWLS) algorithm. Necessary and sufficient conditions are given for kernelized 
IRWLS to converge to the global minimizer of the M-estimator objective function. The 



C^ robustness of the RKDE is demonstrated with a representer theorem, the influence 



function, and experimental results for density estimation and anomaly detection. 

Keywords: outlier, reproducing kernel feature space, kernel trick, influence function, M- 
estimation 

1 Introduction 

The kernel density estimator (KDE) is a well-known nonparametric estimator of univariate 
or multivariate densities, and numerous articles have been written on its properties, appli- 



cations, and extensions (Silverman, 1986 Scott, 1992). However, relatively little work has 



been done to understand or improve the KDE in situations where the training sample is 
contaminated. This paper addresses a method of nonparametric density estimation that 



generalizes the KDE, and exhibits robustness to contamination of the training sample. [] 
Consider training data following a contamination model 



lid 



Xi,...,X„ ~ (1 -p)fo+pfi 



(1) 



where /o is the "nominal" density to be estimated, /i is the density of the contaminating 
distribution, and p < 2 is the proportion of contamination. Labels are not available, so that 
the problem is unsupervised. The objective is to estimate /o while making no parametric 
assumptions about the nominal or contaminating distributions. 

Clearly /o cannot be recovered if there are no assumptions on /q, /i and p. Instead, we 
will focus on a set of nonparametric conditions that are reasonable in many practical appli- 
cations. In particular, we will assume that, relative to the nominal data, the contaminated 
data are 

(a) outlying: the densities /o and /i have relatively little overlap 

(b) diffuse: /i is not too spatially concentrated relative to /o 

(c) not abundant: a minority of the data come from /i 

Although we will not be stating these conditions more precisely, they capture the intuition 
behind the quantitative results presented below. 

As a motivating application, consider anomaly detection in a computer network. Imagine 
that several multi-dimensional measurements Xi, . . . , X„ are collected. For example, each 
Xj may record the volume of traffic along certain links in the network, at a certain instant 



in time (Chhabra et al. , 2008). If each measurement is collected when the network is in a 



nominal state, these data could be used to construct an anomaly detector by first estimating 
the density /o of nominal measurements, and then thresholding that estimate at some level 
to obtain decision regions. Unfortunately, it is often difficult to know that the data are 
free of anomalies, because assigning labels (nominal vs. anomalous) can be a tedious, labor 
intensive task. Hence, it is necessary to estimate the nominal density (or a level set thereof) 
from contaminated data. Furthermore, the distributions of both nominal and anomalous 
measurements are potentially complex, and it is therefore desirable to avoid parametric 
models. 

The proposed method achieves robustness by combining a traditional kernel density 



estimator with ideas from M-estimation (Huber, 1964; Hampel, 1974). The KDE based 



Shorter versions of this work previously appeared at the International Conference on Acoustics, Speech, 



and Signal Processing ( Kim & Scott 2008 \ and the International Conference on Machine Learning ( Kim & 



Scott 2011 1 



on a radial, positive semi-definite (PSD) kernel is interpreted as a sample mean in the 
reproducing kernel Hilbert space (RKHS) associated with the kernel. Since the sample 
mean is sensitive to outliers, we estimate it robustly via M-estimation, yielding a robust 
kernel density estimator (RKDE). We describe a kernelized iteratively re- weighted least 
squares (KIRWLS) algorithm to efficiently compute the RKDE, and provide necessary and 
sufficient conditions for the convergence of KIRWLS to the RKDE. 

We also offer three arguments to support the claim that the RKDE robustly estimates 
the nominal density and its level sets. First, we characterize the RKDE by a representer 
theorem. This theorem shows that the RKDE is a weighted KDE, and the weights are 
smaller for more outlying data points. Second, we study the influence function of the 
RKDE, and show through an exact formula and numerical results that the RKDE is less 
sensitive to contamination by outliers than the KDE. Third, we conduct experiments on 
several benchmark datasets that demonstrate the improved performance of the RKDE, 
relative to competing methods, at both density estimation and anomaly detection. 

One motivation for this work is that the traditional kernel density estimator is well- 
known to be sensitive to outliers. Even without contamination, the standard KDE tends to 
overestimate the density in regions where the true density is low. This has motivated several 
authors to consider variable kernel density estimators (VKDEs), which employ a data- 
dependent bandwidth at each data point (Breiman et aH 1977[ [Abramson 1982 Terrell & 



Scott, 1992). This bandwidth is adapted to be larger where the data are less dense, with 



the aim of decreasing the aforementioned bias. Such methods have been applied in outlier 



detection and computer vision applications (Comaniciu et al. 2001 Latecki et al. , 2007), 



and are one possible approach to robust nonparametric density estimation. We compare 
against these methods in our experimental study. 

Density estimation with positive semi-definite kernels has been studied by several au- 



thors. Vapnik & Mukherjee (2000) optimize a criterion based on the empirical cumulative 



distribution function over the class of weighted KDEs based on a PSD kernel. Shawe- Taylor] 



Sz Doha (2007) provide a refined theoretical treatment of this approach. Song et al. (2008) 



adopt a different criterion based on Hilbert space embeddings of probability distributions. 
Our approach is somewhat similar in that we attempt to match the mean of the empiri- 
cal distribution in the RKHS, but our criterion is different. These methods were also not 
designed with contaminated data in mind. 

We show that the standard kernel density estimator can be viewed as the solution to 
a certain least squares problem in the RKHS. The use of quadratic criteria in density 
estimation has also been previously developed. The aforementioned work of Song et al. 



optimizes the norm-squared in Hilbert space, whereas Kim (1995); Girolami &; He (2003); 



Kim Sz Scott (2010); Mahapatruni &; Gray (2011) adopt the integrated squared error. Once 



again, these methods are not designed for contaminated data. 

Previous work combining robust estimation and kernel methods has focused primarily on 
supervised learning problems. M-estimation applied to kernel regression has been studied by 
various authors (Christmann &: Steinwart, 2007; Debruyne et al. , 2008a|b Zhu et al. , 2008 



Wibowo , 2009 ; Brabanter et al. 2009 ) . Robust surrogate losses for kernel-based classifiers 



have also been studied (Xu et al. , 2006). In unsupervised learning, a robust way of doing 



kernel principal component analysis, called spherical KPCA, has been proposed, which 
applies PCA to feature vectors projected onto a unit sphere around the spatial median in a 



kernel feature space (Debruyne et al. , 2010). The kernelized spatial depth was also proposed 



to estimate depth contours nonparametrically (Chen et al. , 2009). To our knowledge, the 



RKDE is the first application of M-estimation ideas in kernel density estimation. 

In Section [2] we propose robust kernel density estimation. In Section [3] we present a 
representer theorem for the RKDE. In Section |4] we describe the KIRWLS algorithm and 
its convergence. The influence function is developed in Section [Sj and experimental results 
are reported in Section [6j Conclusions are offered in Section [7} Section [8] contains proofs of 
theorems. Matlab code implementing our algorithm is available at |Www . eecs . umich . edu/ 
|~cscottj 



2 Robust Kernel Density Estimation 

Let Xi, . . . ,X„ G R be a random sample from a distribution F with a density /. The 
kernel density estimate of /, also called the Parzen window estimate, is a nonparametric 
estimate given by 



Ikde (x) 



4 = 1 



"'(T (.X, A.J 



where k^ is a kernel function with bandwidth a. To ensure that fKDEi^) is a density, 
we assume the kernel function satisfies kcr{- , ■) > and J fco- (x, • ) dx = 1 . We will also 
assume that fco-(x, x') is radial, in that ^^(x, x') = g{\\:x. — x'|p) for some g. 

In addition, we require that ka be positive semi-definite, which means that the matrix 
{ka{xi,Xj))i<ij<rn is positive semi-definite for all positive integers m and all xi, . . . ,Xm £ 
M"^. For radial kernels, this is equivalent to the condition that g is completely monotone. 



I.e., 



{-^)''-^9{t)>0, forallfc>l,t>0, 



dt'^' 



limg{t)=g{0), 
t— s>0 



and to the assumption that there exists a finite Borel measure ^ on M+ = [0, oo) such that 

/co-(x,x')= / exp(— t ||x — x'll )(i/z(t). 



See Scovel et al. (2010). Well-known examples of kernels satisfying all of the above properties 



are the Gaussian kernel 



I \'^ / ||x-x'|P~ 
fc.(x,x')=(^^J exp^ ^-^), (2) 



the multivariate Student kernel 



Mx x') - ( ^ V • ^(^" + '^/'^ . f 1 + ^ II" -"T 



iz + d 
2 



and the Laplacian kernel 

ka{x,x') = ^exp 



Cd I ||x-x'| 



a"' \ a 



where c^ is a constant depending on the dimension d that ensures J ku (x, • ) dx = 1. The 
PSD assumption does, however, exclude several common kernels for density estimation, 
including those with finite support. 

It is possible to associate every PSD kernel with a feature map and a Hilbert space. Al- 
though there are many ways to do this, we will consider the following canonical construction. 
Define '&(x) = ka{-,x.), which is called the canonical feature map associated with k„. Then 
define the Hilbert space of functions T-L to be the completion of the span of {<l?(x) : x G M }. 
This space is known as the reproducing kernel Hilbert space (RKHS) associated with k^- 



See Steinwart Sz Christmann (2008) for a thorough treatment of PSD kernels and RKHSs. 
For our purposes, the critical property of T-L is the so-called reproducing property. It states 
that for all g € Ti and all x € MJ^, g{x.) = {^{x^),g)-}{. As a special case, taking g = k^{-,:x.'), 
we obtain 

A;(x,xO = (<l>(x),<I>(x')) 

for all X, x' G M'*. Therefore, the kernel evaluates the inner product of its arguments after 
they have been transformed by ^. 



For radial kernels, ||$(x)||-^ is constant since 

\m^)fn = (1>(x),cl>(x))^ = A:,(x,x) = A:^(0,0). 

We will denote r = ||$(x)||^. 

From this point of view, the KDE can be expressed as 



1 " 
fKDE(-) = -VA;a(-,Xj) 
n ^-^ 

i=\ 
n 



the sample mean of the $(Xi)'s. Equivalently, /kde G "H is the solution of 

n 

4 = 1 

Being the solution of a least squares problem, the KDE is sensitive to the presence of 
outliers among the $(Xj)'s. To reduce the effect of outliers, we propose to use M-estimation 



(Huber, 1964) to find a robust sample mean of the <l>(Xj)'s. For a robust loss function p{x) 



on a: > 0, the robust kernel density estimate is defined as 



Irkde = argminV'p(||$(Xi) - g\\n)- 



(3) 



Well-known examples of robust loss functions are Huber's or Hampel's p. Unlike the 
quadratic loss, these loss functions have the property that ip = p' is bounded. For Hu- 
ber's p, ip is given by 

X, < X < a 



ip (x) 



(4) 



a, a < X. 



and for Hampel's p. 



V'(x) 



(5) 



X, < X < a 

o, a < X < b 

a ■ {c — x)/{c — 6), b < X < c 

0, c < X. 

The functions p{x),ip{x), and ip{x)/x are plotted in Figure [I| for the quadratic, Huber, and 
Hampel losses. Note that while 'ip{x)/x is constant for the quadratic loss, for Huber's or 
Hampel's loss, this function is decreasing in x. This is a desirable property for a robust 
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Figure 1: The comparison between three different p{x), ip{x), and il){x)/x: quadratic, 
Ruber's, and Hampel's. 



loss function, which will be explained later in detail. While our examples and experiments 
employ Ruber's and Hampel's losses, many other losses can be employed. 

We will argue below that /rkde is a valid density, having the form X^ILi ^^^cC''-^*) 
with weights Wi that are nonnegative and sum to one. To illustrate the estimator, Figure [2] 
(a) shows a contour plot of a Gaussian mixture distribution on M? . Figure p\ (b) depicts a 
contour plot of a KDE based on a training sample of size 200 from the Gaussian mixture. 
As we can see in Figure [2] (c) and (d), when 20 contaminating data points are added, the 
KDE is significantly altered in low density regions, while the RKDE is much less affected. 





(a) True density 



(b) KDE without outliers 





(c) KDE with outhers 



(d) RKDE with outhers 



Figure 2: Contours of a nominal density and kernel density estimates along with data 
samples from the nominal density (o) and contaminating density (x). 200 points are from 
the nominal distribution and 20 contaminating points are from a uniform distribution. 

Throughout this paper, we define ^^ix) = iIj{x)/x and consider the following assumptions 
on p, tp, and 93: 

(Al) p is non-decreasing, p(0) = 0, and p{x)/x — )• as x — )• 
(A2) (/^(O) = lim3;^o ^4r^ exists and is finite 



(A3) V si^d '^ ^re continuous 



(A4) -0 and ip are bounded 

(A5) (/9 is Lipschitz continuous 

which hold for Ruber's and Hampel's losses, as well as several others. 

3 Represent er Theorem 

In this section, we will describe how fuKDEi^) can be expressed as a weighted combination 
of the A;o-(x, Xj)'s. A formula for the weights explains how a robust sample mean in T-L 
translates to a robust nonparametric density estimate. We also present necessary and 
sufficient conditions for a function to be an RKDE. From (3), /rkde = argmin^g-^ Jid), 

where 

1 " 
Ji9) = -Y.P(\\'^(^^)-9\\n)- (6) 

1=1 

First, let us find necessary conditions for (/ to be a minimizer of J. Since the space over 
which we are optimizing J is a Hilbert space, the necessary conditions are characterized 
through Gateaux differentials of J. Given a vector space X and a function T : Af — )• M, the 
Gateaux differential of T at x £ X with incremental h £ X is defined as 

^r(x;/^)=lim ^(" + "^^-^(^^ 

If 6T{xo; h) is defined for all h £ X, a necessary condition for T to have a minimum at xq 



is that 6T{xq; h) = for all h £ X (Luenberger, 1997). From this optimality principle, we 
have the following lemma. 

Lemma 1. Suppose assumptions (Al) and (A2) are satisfied. Then the Gateaux differential 
of J at g £% with incremental h £ Ti is 

SJ{g;h) = -{V{g),h)^ 

where V :% ^H. is given by 

1 " 

^(5) = n^ '^(ll^(X^) -9\\n)- {H^^) - g) ■ 

i=l 

A necessary condition for g = fuKDE is V{g) = 0. 

Lemma [T] is used to establish the following representer theorem, so named because 
fuKDE can be represented as a weighted combination of kernels centered at the data points. 



Similar results are known for supervised kernel methods (Scholkopf et al. , 2001). 



Theorem 1. Suppose assumptions (Al) and (A2) are satisfied. Then, 

n 
fRKDEix) = '^Wik^{x,Xi) (7) 

i=l 

where Wi > 0, ^"=i Wi = 1. Furthermore, 

Wi(X^{\\<^(Xi)-fRKDE\\n)- (8) 

It follows that Jrkde is a density. The representer theorem also gives the following 
interpretation of the RKDE. If c/? is decreasing, as is the case for a robust loss, then Wi will 
be small when ||<I>(Xj) — fRKDEWv. is large. Now for any g £ 71, 

||$(X,) - g\\l^ = {^{Xi) - g, c|>(X,) - g)n 

= \mx,)\\l-2mx,),g)n + \\gfn 

|2 

In- 



T^ - 2giXi) + \\gf 



Taking g = /rkde, we see that Wi is small when fRKDEp^i) is small. Therefore, the RKDE 
is robust in the sense that it down-weights outlying points. 

Theorem 111 provides a necessary condition for /rkde to be the minimizer of ([6|. With 
an additional assumption on J, this condition is also sufficient. 

Theorem 2. Suppose that assumptions (Al) and (A2) are satisfied, and J is strictly convex. 
Then Vm, |^, and X^ILi ^i = 1 o,re sufficient for Jrkde to be the minimizer of !(m. 

Since the previous result assumes J is strictly convex, we give some simple conditions 
that imply this property. 

Lemma 2. J is strictly convex provided either of the following conditions is satisfied: 

(i) p is strictly convex and non- decreasing. 

(a) p is convex, strictly increasing, n > 3, and K = (A;o-(Xj,Xj))['- -j^ is positive definite. 

The second condition implies that J can be strictly convex even for the Huber loss, 
which is convex but not strictly convex. 

4 KIRWLS Algorithm and Its Convergence 



In general, ([3j) does not have a closed form solution and fRKDE has to be found by an iter- 
ative algorithm. Fortunately, the iteratively re-weighted least squares (IRWLS) algorithm 

10 



used in classical M-estimation (Huber, 1964) can be extended to a RKHS using the kernel 



trick. The kernelized iteratively re- weighted least squares (KIRWLS) algorithm starts with 
initial w^ G M , i = 1, . . . ,n such that w\ > and Y17=i ^i ~ 1' ^^'^ generates a 
sequence {f^'} by iterating on the following procedure: 

n 



/W = ^^f-i).,(X,), 



i=l 



(fe)_ v.(||$(X,)-/W||^) 

* E-=i^(ll<^(x,) -/('=) IIh)' 

Intuitively, this procedure is seeking a fixed point of equations (JTl) and ([S]). The computation 
of [[^(Xj) — f^'^'W'H can be done by observing 



$(x,) - f^'Ml'u = i^i^j) - f'^'K ^(x,) - /(') 



w 



($(X,),<i>(X,)L-2<$(X,),/('=)L + (/('),/('=)> 



'-«• 



Since /(^) = ELi ^f ~ ^(Xi), we have 



(cI>(X,),<i>(X,); 

(«^(x,),/^'); 



n 



n 



(^fik)jik)^ 



n 



fccr(Xj,Xj) 
n 

n n 

EE 



(fc-i)^ (fc-i) 



'w; 



^a(Xi,X;). 



Recalling that $(x) = /C(j(-,x), after the kth iteration 

n 

(fc-1). 



/W(x) = ^^f-^^fc.(x,X,). 



4 = 1 

Therefore, KIRWLS produces a sequence of weighted KDEs. The computational complexity 
is 0{n?) per iteration. In our experience, the number of iterations needed is typically well 
below 100. Initialization is discussed in the experimental study below. 

KIRWLS can also be viewed as a kind of optimization transfer/majorize-minimize algo- 



rithm (Lange et al. 2000 Jacobson Sz Fessler, 2007) with a quadratic surrogate for p. This 



perspective is used in our analysis in Section 8.4, where f^' is seen to be the solution of a 
weighted least squares problem. 

The next theorem characterizes the convergence of KIRWLS in terms of {J{f^^')}'kLi 
and {/('=)}r=i. 



11 



Theorem 3. Suppose assumptions (Al) - (A3) are satisfied, and f{x) is nonincreasing. 
Let 

and {/''^'l^x ^^ ^^^ sequence produced by the KIRWLS algorithm. Then, J{f^^') mono- 
tonically decreases at every iteration and converges. Also, S ^ 9 and 



||/('=)-5||^^inf||/W-g||^^0 



as fc —7- oo. 



In words, as the number of iterations grows, f^^' becomes arbitrarily close to the set of 
stationary points of J, points g £ Ti satisfying 6J{g; h) = \/h £ Ti. 



convex. 



Corollary 1. Suppose that the assumptions in Theorem^ hold and J is strictly 
Then, {f^ '}^i converges to Jrkde in the T-L-norm. 

This follows because under strict convexity of J, |5| = 1. 

5 Influence Function for Robust KDE 

To quantify the robustness of the RKDE, we study the influence function. First, we recall 
the traditional influence function from robust statistics. Let T[F) be an estimator of a 
scalar parameter based on a distribution F. As a measure of robustness of T, the influence 



function was proposed by Hampel (1974). The influence function (IF) for T at i^ is defined 
as 

s^O S 

where Sx' represents a discrete distribution that assigns probability 1 to the point x' . Ba- 
sically, IF{x'; T, F) represents how T{F) changes when the distribution F is contaminated 
with infinitesimal probability mass at x' . One robustness measure of T is whether the 
corresponding IF is bounded or not. 

For example, the maximum likelihood estimator for the unknown mean 9 of Gaussian 
distribution is the sample mean T{F), 

T{F) = Ef[X] = I xdF{x). (9) 

The infiuence function for T{F) in ([9]) is 

s-s>0 S 

= x' -Ef[X]. 
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Since \IF{x';T,F)\ increases without bound as x' goes to ±00, the estimator is considered 
to be not robust. 

Now, consider a similar concept for a function estimate. Since the estimate is a function, 
not a scalar, we should be able to express the change of the function value at every x. 

Definition 1 (IF for function estimate). Let T{x;F) be a function estimate based on F, 
evaluated at x. We define the influence function for T(x; F) as 

s^O S 

where Fg = {1 — s)F + s5-x_i . 

IFlxjx'-jTjF) represents the change of the estimated function T at x when we add 
infinitesimal probability mass at x' to F. For example, the standard KDE is 



r(x; F) = Jkde{^; F) = j K{x, y)dF(y) 



where X ~ F. In this case, the influence function is 

TTp, I 7 T?\ r fKDE{x;Fs) - fKDEi^;F) 
IF{x, X ; fKDE, F) = hni 

s->0 S 

EpAK{x,X)] - EF[k^{x,X 



lim 

s^O S 

^. -sEF[k„{x,X)] + sEsJkaix,:S.)] 
lim 

s-^O S 

-EF[k^{x,X)] + EsJk^{x,X)] 

-SF[fca(x,X)]+fc,(x,xO (10) 



With the empirical distribution Fn = - X^^Li ^Xi, 

/F(x,x';/xDS,F„) = --VA;,(x,X,) + fc,(x,xO. (11) 

n ^ — * 



n 



To investigate the influence function of the RKDE, we generalize its definition to a 
general distribution ^u, writing fRKDEi • ; ^) = /^ where 



/^ = argmin / p(||^(x) - g\\n) dfi{x). 



gan 

For the robust KDE, T(x, F) = fRKDEi^', F) = (<I>(x), fF)'H-> we have the following charac- 
terization of the influence function. Let q{x) = xtp'{x) — ip{x). 

13 



Theorem 4. Suppose assumptions (Al)-(A5) are satisfied. In addition, assume that fp^, 



A i;^ _ fPs-f, 



fp as s — )• 0. If fp = lims-5>o % exists, then 

IF{^,^';fRKDE,F) = m^),fF)n 
where fp £ H satisfies 

j ^im-K) - fp\\H)dF\ ■ fp 

: ($(x')-/f)-V'(||$(x')-/f||«). (12) 



Unfortunately, for Huber or Hampel's p, there is no closed form solution for fp of (12). 
However, if we work with Fn instead of F, we can find fp^ explicitly. Let 

l = [l,...,lf, 
k' = [K{-K, Xi), . . . , ka{y^, X„)]^, 

In be the n x n identity matrix, K = {k„(X.i,'X.j))f^i .^^ be the kernel matrix, Q be a 
diagonal matrix with Qu = Q(||$(Xi) - /irJ|^)/||$(Xi) - /f,J||^, 

n 

J = Y,vi\m^^)-fFJn), 

i=l 

and 

W = [wi,...,Wn]'^, 

where w gives the RKDE weights as in ln\. 

Theorem 5. Suppose assumptions (Al)-(A5) are satisfied. In addition, assume that 

• fp^ ^ — )• fp^ as s ^- (satisfied when J is strictly convex) 

• the extended kernel matrix K' based on {Xj}"^-|^ Ul^'l ^^ positive definite. 

Then, 

n 

/F(x, x'; fRKDE, Fn) = ^ aik^{x, Xj) + ak^{x, x') 

where 

a' = n-ip{\Mx')-fpJn)/-f 
and a = [ai, . . . , a„] is the solution of the following system of linear equations: 

^In + {In-l- ^"^fQiln " 1 " W^)i^|a 

= - mpiim^') - fpjn)^ - a {In - 1 • w^f Q •(/„-!• w^) • k'. 
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Figure 3: (a) true density and density estimates, (b) IF as a function of x when x' = —5 

Note that a' captures the amount by which the density estimator changes near x' in 
response to contamination at x'. Now a' is given by 

^{\m^') - fpjn) 

^Er=i<^(ii^(Xi)-/Fji«)' 



For a standard KDE, we have (f = 1 and a' = 1, in agreement with (11). For robust 
p, c^(||$(x') — fFnWy.) can be viewed as a measure of "inlyingness" , with more inlying 
points having larger values. This follows from the discussion just after Theorem [l] If the 
contaminating point x' is less inlying than the average Xj, then a' < 1. Thus, the RKDE 
is less sensitive to outlying points than the KDE. 

As mentioned above, in classical robust statistics, the robustness of an estimator can be 
inferred from the boundedness of the corresponding influence function. However, the influ- 
ence functions for density estimators are bounded even if ||x'|| — )• oo. Therefore, when we 
compare the robustness of density estimates, we compare how close the influence functions 
are to the zero function. 

Simulation results are shown in Figure [3] for a synthetic univariate distribution. Figure 
[3] (a) shows the density of the distribution, and three estimates. Figure [s] (b) shows the 
corresponding influence functions. As we can see in (b), for a point x' in the tails of F, the 
influence functions for the robust KDEs are overall smaller, in absolute value, than those of 
the standard KDE (especially with Hampel's loss). Additional numerical results are given 
in Section [621 

Finally, it is interesting to note that for any density estimator /, 

/ /F(x. .'; /, F) dx = lim f f(-: "■) "- I fi-: ") -^ ^ o. 

J s^O S 
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Thus ol = — X^"=i Oil for a robust KDE. This suggests that since /rkde has a smaller 
increase at x' (compared to the KDE), it will also have a smaller decrease (in absolute 
value) near the training data. Therefore, the norm of /-F(x, x'; Jrkde, Fn) should be smaller 
overall when x' is an outlier. We confirm this in our experiments below. 

6 Experiments 



The experimental setup is described in 6.1 and results are presented in 6.2 



6.1 Experimental Setup 

Data, methods, and evaluation are now discussed. 

6.1.1 Data 

We conduct experiments on 15 benchmark data sets (Banana, B. Cancer, Diabetes, F. Solar, 
German, Heart, Image, Ringnorm, Splice, Thyroid, Twonorm, Waveform, Pima Indian, Iris, 
MNIST), which were originally used in the task of classification. The data sets are available 
online: see http://www.fml.tuebingen.mpg.de/Members/ for the first 12 data sets and the 
UCI machine learning repository for the last 3 data sets. There are 100 randomly permuted 
partitions of each data set into "training" and "test" sets (20 for Image, Splice, and MNIST). 
Given Xi, . . . , X„ ~ / = (1 — p) • /o + P " /i> our goal is to estimate /o, or the level sets 
of /q. For each data set with two classes, we take one class as the nominal data from /o 
and the other class as contamination from /i. For Iris, there are 3 classes and we take one 
class as nominal data and the other two as contamination. For MNIST, we choose to use 
digit as nominal and digit 1 as contamination. For MNIST, the original dimension 784 
is reduced to 8 via kernel PGA using a Gaussian kernel with bandwidth 30. For each data 
set, the training sample consists of uq nominal data and rii contaminating points, where 
?ii = e • no for e = 0, 0.05, 0.10, 0.15, 0.20, 0.25 and 0.30. Note that each e corresponds to 
an anomaly proportion p such that p = j^. no is always taken to be the full amount of 
training data for the nominal class. 

6.1.2 Methods 

In our experiments, we compare three density estimators: the standard kernel density 
estimator (KDE), variable kernel density estimator (VKDE), and robust kernel density 
estimator (RKDE) with Hampel's loss. For all methods, the Gaussian kernel in pi) is used 
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as the kernel function /cq- and the kernel bandwidth a is set as the median distance of a 
training point Xj to its nearest neighbor. 

The VKDE has a variable bandwidth for each data point, 



1 " 



and the bandwidth Oi is set as 



a 



i=\ 



V 



fKDEi^i) 



1/2 



where rj is the mean of {fKDEO^i)}^=i (Abramson 



1982 



Comaniciu et al. 



2001). There is 



another implementation of the VKDE where ai is based on the distance to its A:-th nearest 



neighbor (Breiman et al. , 1977). However, this version did not perform as well and is 



therefore omitted. 

For the RKDE, the parameters a, b, and c in ([5| are set as follows. First, we compute 
fmedi the RKDE based on p = | • |, and set di = ||<1> (Xj) — fmedWu- Then, a is set to be 
the median of {(!{}, b the 75th percentile of {(!{}, and c the 85th percentile of {di}. After 
finding these parameters, we initialize w- such that f'^^' = fmed and terminate KIRWLS 
when 



|j(/(fe+i))_j(/W)| 

j(/W) <'' • 

6.1.3 Evaluation 

We evaluate the performance of the three density estimators in three different settings. 
First, we use the influence function to study sensitivity to outliers. Second and third, we 
compare the methods at the tasks of density estimation and anomaly detection, respectively. 
In each case, an appropriate performance measure is adopted. These are explained in detail 



in Section 6.2 To compare a pair of methods across multiple data sets, we adopt the 



Wilcoxon signed-rank test (Wilcoxon, 1945). Given a performance measure, and given a 



pair of methods and e, we compute the difference hi between the performance of two density 
estimators on the ith data set. The data sets are ranked 1 through 15 according to their 
absolute values \hi\, with the largest \hi\ corresponding to the rank of 15. Let Ri be the 
sum of ranks over these data sets where method 1 beats method 2, and let i?2 be the sum 
of the ranks for the other data sets. The signed-rank test statistic T = in.m{Ri,R2) and the 
corresponding p-value are used to test whether the performances of the two methods are 
significantly different. For example, the critical value of T for the signed rank test is 25 at 
a significance level of 0.05. Thus, if T < 25, the two methods are significantly different at 
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method 1 


method 2 




a(x') 


/3(x') 






Ri 


120 


120 


RKDE 


KDE 


R2 

T 














p- value 


0.00 


0.00 



Table 1: The signed-rank statistics and p- values of the Wilcoxon signed-rank test using the 
medians of {a(x')} and {/3(x')} as a performance measure. If Ri is larger than R2, method 
1 is better than method 2. 

the given significance level, and the larger of Ri and R2 determines the method with better 
performance. 



6.2 Experimental Results 

We begin by studying influence functions. 

6.2.1 Sensitivity using influence function 

As the first measure of robustness, we compare the influence functions for KDEs and 
RKDEs, given in (11) and Theorem^ respectively. To our knowledge, there is no formula 
for the influence function of VKDEs, and therefore VKDEs are excluded in the comparison. 
We examine a(x') = IF(x.' ,x';T,Fn) and 

/3(x')=fy(/F(x,x';r,F„))'dx') . 

In words, a(x') reflects the change of the density estimate value at an added point x' and 
/3(x') is an overall impact of x' on the density estimate over M.'^. 

In this experiment, e is equal to 0, i.e, the density estimators are learned from a pure 
nominal sample. Then, we take contaminating points from the test sample, each of which 
serves as an x'. This gives us multiple a(x')'s and /3(x')'s. The performance measures are 
the medians of {a(x')} and {/3(x')} (smaller means better performance). The results using 
signed rank statistics are shown in Table [T] The results clearly states that for all data sets, 
RKDEs are less affected by outliers than KDEs. 
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6.2.2 Kullback-Leibler (KL) divergence 

Second, we present the Kullback-Leibler (KL) divergence between density estimates / and 

fo, _ 

DKL{f\\fo)= //(x)log^dx. 
J /o(x) 

This KL divergence is large whenever / estimates /o to have mass where it does not. 

The computation of Dkl is done as follows. Since we do not know the nominal /o, it is 
estimated as /o, a KDE based on a separate nominal sample, obtained from the test data 
for each benchmark data set. Then, the integral is approximated by the sample mean, i.e., 



n 



DKLif\\fo)^^log-^^'^^ 



i=l 



/o(xD 



where {x^}"^]^ is an i.i.d sample from the estimated density / with n' = 2n = 2(no + ni). 
Note that the estimated KL divergence can have an infinite value when /o(y) = (to 
machine precision) and /(y) > for some y G W^. The averaged KL divergence over the 
permutations are used as the performance measure (smaller means better performance). 
Table [2] summarizes the results. 

When comparing RKDEs and KDEs, the results show that KDEs have smaller KL 
divergence than RKDEs with e = 0. As e increases, however, RKDEs estimate /o more 
accurately than KDEs. The results also demonstrate that VKDEs are the worst in the 
sense of KL divergence. Note that VKDEs place a total mass of 1/n at all Xj, whereas the 
RKDE will place a mass Wi < 1/n at outlying points. 

6.2.3 Anomaly detection 

In this experiment, we apply the density estimators in anomaly detection problems. If we 
had a pure sample from /o, we would estimate /o and use {x : /o(x) > A} as a detector. 
For each A, we could get a false negative and false positive probability using test data. By 
varying A, we would then obtain a receiver operating characteristic (ROC) and area under 
the curve (AUG). However, since we have a contaminated sample, we have to estimate /o 
robustly. Robustness can be checked by comparing the AUG of the anomaly detectors, 
where the density estimates are based on the contaminated training data (higher AUG 
means better performance). 

Examples of the ROGs are shown in Figure |4j The RKDE provides better detection 
probabilities, especially at low false alarm rates. This results in higher AUG. For each pair 
of methods and each e, i?i, R2, T and p- values are shown in Table |3j The results indicate 
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method 1 


method 2 




e 


0.00 


0.05 


0.10 


0.15 


0.20 


0.25 


0.30 






Ri 


26 


67 


78 


83 


94 


101 


103 


RKDE 


KDE 


R2 
T 


94 
26 


53 
53 


42 
42 


37 
37 


26 
26 


19 
19 


17 
17 






p- value 


0.06 


0.72 


0.33 


0.21 


0.06 


0.02 


0.01 






Ri 


104 


117 


117 


117 


117 


119 


119 


RKDE 


VKDE 


R2 

T 


16 
16 


3 
3 


3 
3 


3 
3 


3 
3 


1 
1 


1 
1 






p- value 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 






Ri 























VKDE 


KDE 


R2 
T 


120 



120 



120 



120 



120 



120 



120 







p- value 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 



Table 2: The signed-rank statistics and p-values of the Wilcoxon signed-rank test using KL 
divergence as a performance measure. If Ri is larger than R2, method 1 is better than 
method 2. 



that RKDEs are significantly better than KDEs when e > 0.20 with significance level 0.05. 
RKDEs are also better than VKDEs when e > 0.15 but the difference is not significant. 



We also note that we have also evaluated the kernelized spatial depth (KSD) (Chen et al. 



2009) in this setting. While this method does not yield a density estimate, it does aim to 



estimate density contours robustly. We found that the KSD performs worse in terms of 



AUG that either the RKDE or KDE, so those results are omitted (Kim & Scott, 2011) 



7 Conclusions 

When kernel density estimators employ a smoothing kernel that is also a PSD kernel, 
they may be viewed as M-estimators in the RKHS associated with the kernel. While 
the traditional KDE corresponds to the quadratic loss, the RKDE employs a robust loss 
to achieve robustness to contamination of the training sample. The RKDE is a weighted 
kernel density estimate, where smaller weights are given to more outlying data points. These 
weights can be computed efficiently using a kernelized iteratively re-weighted least squares 
algorithm. The decreased sensitivity of RKDEs to contamination is further attested by 
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0.2 0.4 0.6 0.8 
false alarm 

(a) Banana, e = 0.2 




0.4 0.6 
false alarm 



(b) Iris, e = 0.1 



Figure 4: Examples of ROCs. 

the influence function, as well as experiments on anomaly detection and density estimation 
problems. 

Robust kernel density estimators are nonparametric, making no parametric assumptions 
on the data generating distributions. However, their success is still contingent on certain 
conditions being satisfied. Obviously, the percentage of contaminating data must be less 
than 50%; our experiments examine contamination up to around 25%. In addition, the 
contaminating distribution must be outlying with respect to the nominal distribution. Fur- 
thermore, the anomalous component should not be too concentrated, otherwise it may look 
like a mode of the nominal component. Such assumptions seem necessary given the un- 
supervised nature of the problem, and are implicit in our interpretation of the representer 
theorem and influence functions. 

Although our focus has been on density estimation, in many applications the ultimate 
goal is not to estimate a density, but rather to estimate decision regions. Our methodology 
is immediately applicable to such situations, as evidenced by our experiments on anomaly 
detection. It is only necessary that the kernel be PSD here; the assumption that the kernel 
be nonnegative and integrate to one can clearly be dropped. This allows for the use of more 
general kernels, such as polynomial kernels, or kernels on non-Euclidean domains such as 
strings and trees. The learning problem here could be described as one-class classification 
with contaminated data. 

In future work it would be interesting to investigate asymptotics, the bias- variance trade- 
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method 1 


method 2 




e 


0.00 


0.05 


0.10 


0.15 


0.20 


0.25 


0.30 






Ri 


26 


46 


67 


90 


95 


96 


99 


RKDE 


KDE 


R2 
T 


94 
26 


74 
46 


53 
53 


30 
30 


25 
25 


24 
24 


21 
21 






p- value 


0.06 


0.45 


0.72 


0.09 


0.05 


0.04 


0.03 






Ri 


33 


49 


58 


75 


80 


90 


86 


RKDE 


VKDE 


R2 

T 


87 
33 


71 
49 


62 

58 


45 
45 


40 
40 


30 
30 


34 
34 






p- value 


0.14 


0.56 


0.93 


0.42 


0.28 


0.09 


0.15 






Ri 


38 


70 


79 


91 


95 


96 


99 


VKDE 


KDE 


R2 
T 


82 
38 


50 
50 


41 
41 


29 
29 


25 
25 


24 
24 


21 
21 






p- value 


0.23 


0.60 


0.30 


0.08 


0.05 


0.04 


0.03 



Table 3: The signed-rank statistics of the Wilcoxon signed-rank test using AUG as a per- 
formance measure. If Ri is larger than R2, method 1 is better than method 2. 

off, and the efficiency-robustness trade-off of robust kernel density estimators, as well as the 
impact of different losses and kernels. 

8 Proofs 

We begin with three lemmas and proofs. The first lemma will be used in the proofs of 
Lemma |4] and Theorem [5| the second one in the proof of Lemma [2j and the third one in 
the proof of Theorem [3j 

Lemma 3. Let zi, . . . , z^ be distinct points in M.'^. If K = (A;(zj, Zj))f-^ is positive definite, 
then $(zj) = k{ ■ ,Zj) 's are linearly independent. 
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Proof. J2iLi aj^(zj) = implies 



X]«»^(zi) 



m 



H 



^ai$(zi),X]«J^(2j 



rra m 



i=i 



H 



i=i j=i 



and from positive definiteness of K, qi 



am = 0. 



D 



Lemma 4. Let H be a RKHS associated with a kernel k, and xi, X2, anc? X3 he distinct 



points in 



Assume that K = (fc(xj,Xj))| • -j^ is positive definite. For any g,h £71 with 



g ^ h, ^{xi) — g and <&(xj) — h are linearly independent for some i £ {1, 2, 3}. 

Proof. We will prove the lemma by contradiction. Suppose <J>(xj) — g and <l>(xj) — h are 
linearly dependent for all i = 1,2,3. Then, there exists {ai, (3i) 7^ (0,0) for i = 1,2,3 such 
that 



ai($(xi)-5) + /3i($(xi)-/i) = 
a2($(x2) -g) + /32(^(x2) - h) = 
a3($(x3) -g) + /33(^(x3) -h) = 0. 



(13) 
(14) 
(15) 



Note that Oj + /Jj / since g ^ h. 

First consider the case 02 = 0. This gives h = <l>(x2), and ai 7^ and a^ 7^ 0. Then, 



( 13 ) and ( 14 ) simplify to 



9 



ai + /5i 



/3i 



$(X1)-^$(X2), 

ff = ^(X3) ^(X2), 

"3 "3 

respectively. This is contradiction because <^(xi), <^(x2), and <I>(x3) are linearly independent 
by Lemma [3] and 

$(xi) + ^(X2) ^(X3) = 

ai V"3 Oil J 03 

where (ai + /3i)/ai 7^ 0. 

Now consider the case where 02 7^ 0. Subtracting (14) multiplied by ai from (13) 
multiplied by 02 gives 

(ai/32 - a2l3i)h = -02(01 + /3i)<l>(xi) + 01(02 + P2)^{x2)- 
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In the above equation ai/32— a2/3i / because this imphes a2(ai+/3i) = and ai(a2+/32) = 
0, which, in turn, imphes 02 = 0. Therefore, h can be expressed as h = Ai<I>(xi) + A2$(x2) 
where 



Ai 



Q2(ai + /3i) . _ ai(«2 + /32) 
ai/32 - 02/?! ' ai/32 - a2/3i 



Similarly, from (14) and (15), h = A3<l>(x2) + A4$(x3) where 

02(03 + ^3) 



03(02 + /32) , 

-^3 = 7^ —, M 



"2/33 - "3/^2 02/33 - "3/32 

Therefore, we have h = Ai<l>(xi) + A2'l'(x2) = A3<i>(x2) + A4<l>(x3). Again, from the linear 
independence of $(xi), <I*(x2), and <^(x3), we have Ai = 0, A2 = A3, A4 = 0. However, 
Ai = leads to 02 = 0. 

Therefore, $(xj) — g and <l>(xi) — h are linearly independent for some i G {1, 2, 3}. D 

Lemma 5. Given Xi, . . . , X„, let T>n CH be defined as 

f n n V 

^ i=\ 1=1 '' 

Then, P„ is compact. 
Proof. Define 



A 



{Wl,... ,Wn) G 



Wi > 



n . 



and a mapping W 



W : {wu...,Wn) eA^^Wi-^{y.i) £n. 



i=l 



Note that A is compact, W is continuous, and V^ is the image of A under W. Since the 



continuous image of a compact space is also compact (Munkres, 2000), P„ is compact. D 



8.1 Proof of Lemma [T] 

We begin by calculating the Gateaux differential of J. We consider the two cases: <&(x) 
{g + ah) = and $(x) - {g + ah) / 0. 
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For $(x) -{g + ah) / 0, 
^p{\m^) - (g + ah)\\n) 



d 



$(x) -{g + ah)\\n) ■ ^ll^(x) -{g + ah)\\n 



H^) - {g + ah)\\n) 
^^) - (g + ah)\\n) 



d_ 

da 



|$(x)-(5 + a^^ii2 



n 



^\\^{^)-{g + ah)f^ 



2Jm-^)-{g + ah)rn 



$(x)-(ff + Q^)||^) d_ 

2m^) - {g + ah)\\n 'da 
<^{-^) - {g + ah)\\n) 



l^(x) - gfu - 2($(x) - g, ah) + a 



|$(x)-(5 + a/i)||^ V '^ 

V{m-^) -{g + ah)\\n) ■ (-($(x) -{g + ah), h)^ 



For $(x) -{g + ah) = 0, 
d 



da 



p(||$(x)-(ff + a/ 



Iw 



lim ^^"'^^''^ - (g + (« + '^)^)ll^) - p(II^W - (g + «^)ll^) 



(5-s.O 
lim 

lim - 

<5-!>0 



p(||^/i||^)-p(0) 
5 

Mn) 



lim^^o ^, 
lim^^O 5||/,||_, 



W, 



/i = 



(16) 



(^(||$(x) - (5 + a/i)||w) • (-($(x) - (g + a/i), /i) ) 



(17) 



where the second to the last equality comes from (Al) and the last equality comes from the 
facts that <^(x) - {g + ah) = and v?(0) is well-defined by (A2). 



From (16) and (17), we can conclude that for any g, h £71, and x G M , 

^p{\m^) - {g + ah)\\n) 
= ^{\m^) -ig + ah)\\n) ■ (-($(x) -ig + ah), h)^) 



(18) 
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Therefore, 



d 
6J{g; h) = -T^Jid + "^)L=o 

^ i=l 
1 " f) 

i=l 



a=0 



a=0 



n 

- Y, V>{\M^^) -(9 + ah)\\n) ■ {-{H^^) -ig + ah), h)^) 

1 " 

4 = 1 

n 

-^(^(||$(X,)-<7lk)-(^(X.)-<7),/i 
"" j=i 



n 

i=l 



«=0 



^— 1 in 



The necessary condition for (7 to be a minimizer of J, i.e., g = /rkde, is that dJ{g; h) 
0, V/i G n, which leads to V{g) = 0. 

8.2 Proof of Theorem [T] 

From Lemma [ij VQrkde) = 0, that is, 

I "■ 

- V ^(||<I>(X,) - IrkdeWh) ■ m^i) - Irkde) = 0. 
n ^-^ 

Solving for /rkde, we have /rkde = J27=i m^i'^i) where 

Wi= (j2(pi\MXj) - fRKDEWn)) ■^{\\^{'^i)-fRKDE\\n)- 

Since p is non-decreasing, Wi > 0. Clearly X]"^^ Wi = 1 

8.3 Proof of Lemma [2] 

J is strictly convex on Ti if for any < A < 1, and g,h & Ti with g ^ h 

J{\g + (1 - A)/i) < \J{g) + (1 - A) J(/i). 
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Note that 

n 

J{\g + (1 - \)h) = - Y,p{m^^) - A<7 - (1 - \)h\\n) 



n 



= - f; p(||A(<5(X.) -g) + {l- A)(cI>(X.) - h)U) 

n 

< - Y,p{m0^i) - 9\\n + (1 - A)||$(X,) - h\\n) 

i=l 

< - f; Ap(||<^(Xi) - g\\n) + (1 - A)p(||c|>(X, 



_ _ hu) 

= \J{g) + {l-X)J{h). 

The first inequahty comes from the fact that p is non-decreasing and 

||A($(XO -g) + {l- A)(1>(X,) - h)\\H < X\m^^) - g\\n + (1 - A)||$(X,) - h\\n, 

and the second inequahty comes from the convexity of p. 

Under condition (i), p is strictly convex and thus the second inequality is strict, implying 
J is strictly convex. Under condition (ii), we will show that the first inequality is strict 
using proof by contradiction. Suppose the first inequality holds with equality. Since p is 
strictly increasing, this can happen only if 

||A(cl>(X,) -g) + {l- A)(<I>(X,) - h)\\n = A||<I>(X,) - g\\n + (1 - A)||cl>(X,) - h\\n, 

for i = 1, . . . ,n. Equivalently, it can happen only if ($(Xj) —g) and (<l>(Xj) —h) are linearly 
dependent for all i = 1, . . . , n. However, from n > 3 and positive definiteness of K, there 
exist three distinct Xj's, say Zi, Z2, and Z3 with positive definite K' = (/co-(Zj, Zj))? ■ -^. 
By Lemma Hi it must be the case that for some i G {1,2,3}, (<l>(Zj) — g) and (<l>(Zj) — h) 
are linearly independent. Therefore, the inequality is strict, and thus J is strictly convex. 

8.4 Proof of Theorem H 

First, we will prove the monotone decreasing property of J{f^^'). Given r G M, define 

u{x; r) = p{r) rip{r) -\ — ip{r)x . 

If (/? is nonincreasing, then n is a surrogate function of p, having the following property 



u(r; r) = p{r) (19) 

u{x;r)>p{x), \/x. (20) 



(Huber 1981) 
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Define 



1 " 



1=1 



Note that since ip and if are continuous, Q{- ; • ) is continuous in both arguments. 



From (19) and (20), we have 



1 " 

Q(f(^).fW) = -^u{\MX,)-f(>'^\\n,\m^^)-f'^''^\\n) 

i=l 
1 " 

= -Y,pim^^)-f^'^\\n) 

TJ ^ ^ 



i=l 



J^f(k)^ 



and 



1 " 
Q(9; /^'^) = - 5]^(||'J>(X,) - g\\n, ||^(X,) - /W||^) 

n ■^^ 

= J(<7), ygen 

The next iterate /(^+-^) is the minimizer of (5(5; f^^^) since 

n 



i=l 



n 

argmin^(^(||cI.(X0-/('=)||w)-||<I>(X,)-5lll^ 
sew ^ 

argminQ (5 ;/('')) 



From (21), (22), and (23) 



(21) 



(22) 



(23) 



and thus J{f^^') monotonically decreases at every iteration. Since {J{f^^')}^=i is bounded 
below by 0, it converges. 

Next, we will prove that every limit point /* of {/'^}^x belongs to 5. Since the 
sequence {f^^'}'^^i lies in the compact set Vn (see Theorem hu and Lemma pj), it has a 
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convergent subsequence {f^ '}^i- Let /* be the limit of {f^ '}^i- Again, from (21), 



(22), and (23) 



where the first inequahty comes from the monotone decreasing property of J{f^ ')■ By 
taking the hmit on the both side of the above inequality, we have 



Therefore, 



Q(r;r)<Q(5;r) ,^9^n. 



f* = aicgminQ{g;f*) 

-V v.(ii^(x.)-rii^) 



and thus 



Y,^{m^^)-n\n)■m^^)-n = o. 

i=l 

This implies f* G S. 

Now we will prove \\f^^' — 5||-^ — )• by contradiction. Suppose inf^g^ ||/''^^ — gWn ~^ 0. 
Then, there exists e > such that VET G'H,3k > K with infgg^ \\f^^' — gWu > £• Thus, we 
can construct an increasing sequence of indices {ki}'^^ such that inf^g^ \\f^^^' — gWv. ^ e for 
all/ = 1, 2, ... . Since {f^^'''}'^i lies in the compact set Vn, it has a subsequence converging 
to some /"I", and we can choose j such that \\f^^^' — PWu < e/2- Since p is also a limit 
point of {/''"}^i, /^ G 5. This is a contradiction because 

e<inf||/('=^)-5||w<||/('=^)-/t||w<e/2. 
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8.5 Proof of Theorem H] 



Since the RKDE is given as fRKDEi^] P) = ('^(x), /f)-H) the influence function for the 
RKDE is 

fRKDEi'^', Fs) — fRKDEi'^', F) 



/F(x, X ; fRKDE, F) = hm 

s— >0 



hm 



mx),fF^)n-m^)jF)n 



$(x),hm;^^^ ^ 

s^>0 S 



n 



IFs-If 



and thus we need to find fp = hnis^o 

As we generahze the definition of RKDE from Jrkde to fp, the necessary condition 
V(fRKDE) also generahzes. However, a few things must be taken care of since we are 
deahng with integral instead of summation. Suppose ip and (/? are bounded by B' and B" , 
respectively. Given a probability measure fi, define 



M9) = I pim^)-9\\n)dK^). 



(24) 



From (18) 



d 



— J^(g + a/i)|^^o 



a=0 



a=0 



^^j p(||$(x)-(5 + a/i)||«)dMx] 
||^p(||$(x)-(g + a/.)||^)d/x(x) 
jvim^) -{g + ah)\\n) ■ {-(H^) -{9 + ah), h)^) d^x) 

- j^im^) - g\\n) ■ (^(x) - g, h)^ d/x(x) 

- j(ip{m^) - g\\n) ■ (^(x) -g),hj d^(x). 



«=o 



The exchange of differential and integral is valid (Lang, 1993) since for any fixed g,h G T-L, 
and a G (—1, 1) 

d 



da 



p{m^)-{g + ah)\\n) 



= ifim^) -{g + ah)\\) • |-(cl>(x) -{g + ah), h)^\ 

< B".m^)-ig + ah)\\.\\h\\n 

< S".(||cl>(x)||^ + ||5||^ + 11/^11^). ||/i||^ 

< B"-(T+\\g\\n + \\h\\n)-\\h\\n<oo. 
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Since (/3(||$(x) — g\\-H) ■ (^(x) ~ o) is strongly integrable, i.e., 

y ||^(||<I>(x) - g\\y) ■ ('I>(x) - g) 11^ d/i(x) < i?' < oo, 



its Bochner-integral (Berlinet k, Thomas- Agnan , 2004) 



V,{g) ^ I ^(II^I'Cx) - g\\n) • (^(x) - g) d^^{^) 
is well-defined. Therefore, we have 



6J^{g; h) = -(j H\M^) " 9\\n) ■ (^(x) - g) d^x), h 
= -{V,{g),h)^. 



n 



and V^if^) = 0. 

From the above condition for fp^, we have 



Therefore, 



Then, 



= VFAfFj 

= {l-s)- VfUfJ + sVs, UfJ, ys e [0, 1) 



= lim(l - s) ■ VfUfJ + lims • Vs^,{fF. 



limy^(/i.J. 
s— >0 



= lh'n-{VFAfFj-VF{fF) 
s-5>0 S 



: hm - (1 - s)VF{fFj + sVs^, UfJ - VfUf] 

s->0 S 



:lim - V>(/^J - VfUf) - I^VfUfJ + liraV, if^^ 

s^O S \ I s-s>0 s^O ^ 



:lim - V>(/fJ - Vf{!f) + lim^5^,(/Fj 

: hm - f V>(/fJ - V>(/^)) + hm y.(||cl>(x') - JfA) ■ (^(x') - /fJ 



= lim ^ (^yi.(/Fj - VfUf)) + y^(||^(x') - /fII) • ($(x') - /p). 
where the last equality comes from the facts that Jf^ — >• Jf and continuity of (p. 



(25) 
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Let U denote the mapping fj.*-^ f^. Then, 

s->0 S 

= lim '^'^•> - ^<^' 

s-!>0 S 

U{{l-s)F + s6^,)-U{F) 
= nm 

s-!>0 S 

U{F + si6^^-F))-UiF) 
= hm 

s->0 S 

= 5U{F; 5^1 - F) (26) 

where 6U{P; Q) is the Gateaux differential of f7 at P with increment Q. The first term in 



(25) is 



liml(^^^(/^J-F^(/p)) 



= lim ^ {vf{U{Fs)) - Vf{U{F)) 

= lim V(y^oC/)(F,)-(V>oC/)(F)) 

s-5-0 S \ / 

= lim-( (ypoC/)(F + s((5x'-F)) -(VFoU)iF)\ 
s-5-0 s \ y 

= 5{VfoU){F-5^,-F) 

= 6Vf{U{F)-5U{F-6^>-F)) 

= SVfUf-Jf) (27) 

where we apply the chain rule of Gateaux differential, 5{GoH){u; x) = 5G{H{u); 6H{u; x)), 
in the second to the last equality. Although /f is technically not a Gateaux differential 
since the space of probability distributions is not a vector space, the chain rule still applies. 
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Thus, we only need to find the Gateaux differential of Vp- For g,h £ 7i 



5VF{g; h) = lim -^[Vpig + s ■ h) - Vpig) 



hm V / v9(||c^(x) -g-s- hU) ■ ("^(x) -g-s- h)dF{^) 
- j ^(||$(x) - g\\n) ■ ($(x) - 5)dF(x)^ 

^{\m^)-g-s-h\\n)-^im^)-g\\n) 
^{\M^) - g - s ■ h\\n) ■ s ■ h\ dF{^) 

I lim ^ (^ipim^) -g-s- h\\n) - (/5(||$(x) - g\\n)^ ■ ($(x) - g)dFi^) 



lim - 



(<I>(x) - g)dFi^) 



lim - 



-/i- / lim(^(||<l>(x)-5-s./i||^)dF(x) 

V^'dl^lx) - 5||w) • ||<J>(x) - g\\n - V'(ll^(x) - g\\n) {h, '^(x) - g)n 



|^(x)-<7"' 



w 



l^(x)-5llw 



•(<I>(x)-5)dF(x) 

-/i-y"(^(ii<i>(x)-5iiw)rfF(x) 

where in the last equality, we use the fact 
d 



-V.(||cl>(x) -g-s.h\\n) = ^'(m^) -9-s.h\\n)- ?1"\ ' ' " ^^"^ 
9s \\<^[x) - g - s ■ h\\n 



and 



(/3'(x) 



d ^(x) ip'{x)x — ip{x) 



dx X 



x^ 



(28) 



The exchange of limit and integral is valid due to the dominated convergence theorem since 
under the assumption that (p is bounded and Lipschitz continuous with Lipschitz constant 



lifiW^ix) -g-s-h\\)\< oo, Vx 



and 



^{\m^)-g-s.h\\n)-ip{\m^)-g\ 



n. 



■ ^U) 



n 



|(/7(||$(x) -g-s- h\\n) - 95(||^(x) - g\\n)\ • ||^(x) - gWu 



< -L-\\s 
s 

< L- 



n ■ {\M^)\\n + WgWn) 
w (ll^(x)||w + lbllw)<oo, Vx. 
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By combining ([25|), p6|), ([27|, and ([28|, we have 



+ 



^{\M^)-fF\\)dFJ.fp 



II^(x)-/fP 

(<I>(x')-/f)-^(||^(x')-/f 



^•(Z(||$(x)-/^||).($(x)-/^))dF(x) 



where q{x) = xil)'{x) — tp{x). 



8.6 Proof of Theorem [5] 



With Fn instead of F, (12) becomes 



1 " \ 



1 " /(/p, ,$(Xi)-/F )^ 



i=l 



I^(xO-/f„ 



= ($(x')-/k)-v.(||<i>(x')-/fJ|). 

Let r, = ||$(Xi) - /pJI, r' = ||<&(x') - /fJ|, 7 = ELi ^(^0 and 



(29) 



rL rp^ 



Then, (29) simphfies to 

n 



i=l 



Since /i?„ = X]"=i t(;j$(Xj), we can see that fp^ has a form of Yll=i oii^O^i) + Q;'<I>(x'). By 
substituting this, we have 



7 



j=l i=l ^ fc=l ^ 

k=l ^ 



Since K' is positive definite, $(Xj)'s and <I>(x') are hnearly independent (see Lemma p|. 
Therefore, by comparing the coefficients of the $(Xj)'s and <I>(x') in both sides, we have 



7 • "i + dj -Wj-I2^di]= -Wj—y- ■ 



n 



70;' = n ■ ^{r'). 



(30) 
(31) 
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From (31), a' = rnp{r')l-^. Let qi = q{ri)/rf and $(Xj) - fF„ = Ylk=i Wk,i^P^k) where 



WkA 



-Wk , k j^i 
1 — Wk , k = i. 



Then, 



i 



fF„,HX,)-fF„ 



n 



gY 5^ a,<l>(X,) + a'cD(x'), ^ w;fc,i<l>(X, 



i=i 



fc=i 



n 



(n n n \ 

y^ y^ ajt(;fc,iA;^(Xj, Xfc) + a' ^^ Wk,ik^{x' , X^) j 

7 = 1 fc=l fc=l ''^ 



= gi(ei - wf^KoL + g^a' • (e^ - w)'^k' 
= gi(ei-w)^(i^a + a'k') 

where ii' := (A;o-(Xj, Xj))"- ^^ is a kernel matrix, ej denotes the ith standard basis vector, 
and k' = [A;(j(x',Xi, . . . , /co-(x',X„)]^. By letting Q = diag{[qi,... ,g„]), 

d = Q • (/„ - lw'^){Ka + a ■ k'). 



Thus, (30) can be expressed in matrix- vector form. 



7Q + Q • (/„ - 1 • w^){Ka + q' • k') - w • (l^Q •(/„-!• w^){Ka + a' ■ k')) 

= — n • W99(r'). 

Thus, Q can be found solving the following linear system of equations, 

7/„ + (/„-!• w^)^Q • (In - 1 • w^) • k\cx 
-- -n ■ (^(r')w - a' {In - 1 • w^)^Q •(/„-!• w^)k'. 



Therefore, 



/F(x, x'; /Hi,DS, F„) = ( <l>(x), /f„ 



W 



$(x),J^ai$(Xi)+a'$(x') 
y^ aiA;CT(x, Xj) + a'k^{x, x'). 



H 



i=l 
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The condition lims_j.o /f„,s = /f„ is implied by the strict convexity of J. Given Xi, . . . , X„ 
and x', define Pn+i as in Lemma^ From Theoremfll /f„,s and fp^^ are in Pji+i- With the 



definition in (24) 



Jf„.M = J Pi\m^) - g\\n)dFnA^) 



(l-s) 



n 



Y, Pim^^) - 9\\n) + s ■ p(||$(x') - g\\n). 



1=1 



Note that Jf„,s uniformly converges to J on Vn+i, i.e, supgg-p„+i l'^i^n,s(5') ~ "^(5)1 — ^ as 
s — )• 0, since for any g G Pn+i 



kF„,(5)-j(5)i 

^ '^ ^ p{\mx,) -g\\n) + s- p{\m^') -g\\H)--Yl P(ll^(XO - 9\\n] 



n 



i=l 



n 



i=l 



Y^ p(||$(X,) - g\\n) + s ■ pim^') - g\\n) 



n ^ — ^ 



j=l 



= 2s • p{2t) 

where in the inequality we use the fact that p is nondecreasing and 

\Mx)-g\\n < \\^{x)\\ + \\g\\n 
< 2t. 

since g G Vn+i, and by the triangle inequality. 

Now, let e > and Be^fPn) C "H be the open ball centered at fp^ with radius e. Since 
T>^n+i — T^n+i \ BeifPn) is also compact, inf^gxi^ ^ J{g) is attained by some g* S ^^+i 



by the extreme value theorem (Adams &: Pranzosa, 2008). Since fp^ is unique, M, 



Jia*) - JU'fJ > 0. For sufficiently smah s, supggx)„+i l'^Fn,.(5') - -^(s)! < Mj2 and thus 

Jia) - ^ < -/f„,, (5) < Jia) + ^ , V5 G Pn+1 . 
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Therefore, 

Me 

jjf^! Jf„,s{9)> inf J{g)- — 
= AfF.) + M. - ^ 

> JF„,sifF„) 

Since the minimum of Jf„^s is not attained on I^^_(_i, /f„,s £ BeifPn)- Since e is arbitrary, 
hms^o/F„,s = /-Fu- 
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